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Abstract 


SOLUTION  OF  LAfiGB  DENSE 
TRANSPORTATION  PROBLEMS  USING  A 
PARALLEL  PRIMAL  ALGORITOI 


by 

Donald  L.  Miller,  Joseph  F.  Pekny,  and  Gerald  L.  Thompson 
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He  iaplemented^a  version  of  the  primal  transportation  code  ,  on  a  14 
processor  BBN  Butterfly  computer  and  solved  a  variety  of  large,  fully  dense, 
randomly  generated  transportation  and  assignment  problems  ranging  in  sizes  up 
to  a  =  n  =  3000.  We  found — tha^  the  search  phase  of  primal  transportation 
algortfam  was  well  suited  for  implementation  on  the  Butterfly,  but  the  pivot 
phase  could  not  be  parallelized.  Computational  experience  is  presented 
showing  that  speedup  factors  for  a  parallel  over  a  sequential  computer 
of  approximately  70  percent  of  the  theoretical  maximum  were  obtained.  With 
the  parallel  code  the  empirical  difficulty  of  solving  an  nxn  transportation 
problem  was  proportional  to  n*  where  a  varied  between  2.0  and  2.2.  — 
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SOLUTION  OF  LARGE  DENSE 


TRANSPORTATION  PROBLEMS  USING  A 

PARALLEL  PRIMAL  ALG0RIT94 
by 

Donald  L.  Miller,  Joseph  F.  Pekny,  and  Gerald  L.  Thompson 

1.  INTRODUCTION 

Amon^  the  most  commonly  solved  and  applied  linear  programs  are 
transportation  problems,  and  their  close  relations,  assignment  and  network 
problems.  They  have  applications  of  their  own  as  well  as  being  relaxations  to 
others  such  as  travelling  salesman  problems. 

In  the  decade  of  the  1960’s,  the  belief  by  most  researchers  was  that  dual 
(Kuhn  [15])  and/or  primal-dual  (Ford-Fulkerson  [9])  methods  provided  the  most 
efficient  algorithms  for  this  class  of  problems.  This  belief  was  based  on 
limited  computation  performed  on  very  small  problems  by  hand  or  by  first 
generation  computers. 

In  the  early  1970 ’s  two  groups  of  researchers,  Srinivasan  and  Thompson 
[21,22]  and  Glover,  Kamey,  and  Klingman  [11],  wrote  codes  using  the  primal 
transportation  algorithm,  also  called  the  stepping  stone  method  or  MODI 
method,  see  Chames  and  Cooper  [4]  and  Dantzig  [6] .  They  used  some  newly 
invented  data  structures,  as  well  as  some  provided  in  the  computer  science 
literature,  to  greatly  improve  the  efficiency  of  the  resulting  primal  codes, 
and  concluded  that  primal  transportation  codes  ran  100  to  2Q0  times  as  fast  as 
ordinary  linear  programing  simplex  codes  and  about  50  times  as  fast  as  primal 
dual  methods  on  the  same  problems.  These  concliisions  are  summarized  in 
Chames,  Kamey,  Klingman,  Stutz  and  Glover  [5].  Later  Bradley,  Brown  and 
Graves  [3]  came  to  the  same  conclusion  for  network  problems.  Because  of  the 


ccaputer  TTiry  liaitations  of  that  tine,  fully  dense  transportation  problems 
were  solved  for  sizes  up  to  about  m  =  n  =  200.  For  larger  dimensions,  only 
sparse  transportatioD  problems  having  a  relatively  few  arcs  (approximately 
lOn)  were  considered.  Further  improvements  in  data  structures  \ised  by  the 
primal  codes  appeared  in  Barr,  Glover  and  Klingman  [1]. 

In  the  late  1970* s  and  early  1980 *s  there  was  a  resurgance  of  interest  in 
dual  codes.  Auction  and  bidding  dual  codes  were  proposed  by  Bertsekas  [2], 
and  Thompson  [24].  Srinivasan  and  Thompson  [23]  proposed  cost  operator 
algorithms.  A  new  class  of  dual  codes,  using  shortest  augmenting  paths,  based 
on  the  paper  of  Tomizawa  [25]  and  improved  by  Dorbout  L8]  •  were  developed. 
Hung  and  Ram  [13],  Derigs  [7],  and  Nartello  and  Toth  [16]  provided 
computational  studies  that  compared  their  own  shortest  augmenting  path  codes 
with  those  of  the  other  ideas.  McGinnes  [17]  prograamed  both  primal  and  dual 
codes  and  made  computational  studies  and  included  that  dual  codes  were 
siqperior.  See  also  Hatch  [12]  and  Glover  and  Klingman  [10]. 

All  of  the  co^utational  experience  discussed  so  far  was  obtained  using 
sequential  computers  to  solve  small  dense,  or  larger  sparse  problems.  In  the 
last  five  years  a  number  of  hardware  companies  have  built  computers  which  have 
a  parallel  architecture,  that  is,  which  consist  of  several  small  independent 
computers  (processors)  that  can  be  programmed  to  work  simultaneously  on 
different  parts  of  the  sane  problem. 

The  purpose  of  the  present  paper  is  to  discuss  the  implementation  of  a 
primal  transportation  code  on  a  specific  parallel  machine,  a  14  processor  BBN 
Butterfly  computer,  and  to  give  computational  experience  obtained  from  using 
it  to  solve  a  variety  of  large,  fully  dense,  randomly  generated  transportation 
and  assignment  problems  ranging  in  sizes  up  to  a  =  n  =  3000.  Our  conclusions 


I  k  I  '*aft 'k  « 


I  I'*  A'fl  •*»  k*>'‘k 


(a)  The  prieal  transportation  algorithm  is  well  suited  for 
impl  eaentation  on  the  Butterfly  computer  because  of  the  way  the 
neaory  is  distributed  among  the  processors  which  makes  the  search 
step  efficient  to  parallelize. 


(b)  The  pivot  step  of  the  primal  code  is  not  amenable  to 
parallelization. 


(c)  The  search  for  pivot  candidates  becomes  the  dominant  activity  of  the 
primal  algorithm  as  problem  size  increases. 


(d)  Speediq>  increases  as  the  problem  size  increases. 


(e)  Parallel  architecture  permits  the  use  of  coding  strategies  that  are 
difficult  or  not  profitable  to  implement  on  sequential  computers. 


(f)  'Hte  empirical  difficulty  of  solving  an  nxn  transportation  or 
assignment  problem  using  the  parallel  code  is  proportional  to  n 
where  a  varies  between  2.0  and  2.2. 


2.  THE  PRIMAL  ALGORIIWI 


The  transportation  problem  is  to  ship  at  minimum  cost  a  homogeneous  good 


from  a  set  of  m  warehouses  to  a  set  of  n  markets.  Let  x  be  the  amount  of 


the  good  shii^>ed  from  warehouse  i  to  market  j,  let  c,  .  be  the  unit  cost 


of  this  shipsMnit,  let  a^  be  the  availability  (supply)  of  the  good  at 


warehouse  i  and  let  b  be  the  amount  of  the  good  required  (demanded)  at  market 


j.  Assume  £^a^  ~  H^b^  so  that  there  is  a  feasible  shipping  pattern. 


mathematical  statea»nt  of  the  transportation  problem  then  is: 
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Minimize  I  £  L  c  x  I 

^  l-i  j-i  ij  ijj 


Subject  to  E  X  =  a 
J-i  iJ  i 


for  i=l, . . . ,m 
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for  j=l, . . . ,n 


>  0 


for  all  i  and  j 


I 


Letting  and  be  the  dual  variables  associated  with  (2)  and  (3) 

respectively,  the  dual  problea  can  be  stated  as: 


Maxiaize 


/sua  +  Z  V  b\ 
V’t  If  J-i  J  Jj 


(5) 


Subject  to 


u  +  V  <  c 
1  j  -  ij 


for  all  i  and  j 


(6) 


The  following  facta  are  well  known  concerning  these  two  problems: 

(1)  The  problea  has  the  natural  integer  property:  that  is  if  a^  and 
b^  are  positive  integers  for  all  i  and  j,  then  every  feasible 
basic  solution  satisfying  conditions  (2),  (3),  and  (4)  gives 
equal  to  a  nonnegative  integer. 

(2)  In  any  basic  solution,  at  most  a+n-1  values  of  are  positive 

and  exactly  »fn-l  variables  x^  ^  are  basic. . 

(3)  Let  V  =  (R  ,...,R  ,  C  ,...,C  }  be  the  set  of  rows  and  columns  of 

1  Ml  n 

the  problem,  let  x^^  be  a  feeuible  solution  and  let  B  = 
{(i,j)(x^j  is  basic};  then  the  graph  Q  =  {V,B}  is  a  connected 
acyclic  graph  called  the  basis  tree. 

(4)  There  are  well  known  procedures  such  as  northwest  comer,  VAM, 
matrix  minimum,  modified  row  minimum,  etc.  (see  [22])  for  finding  an 
initial  basic  primal  feasible  solution  to  (2),  (3),  and  (4). 

(5)  Given  a  primal  feasible  basis  tree,  any  node  (row  or  column)  can  be 
denoted  as  the  root  node  and  the  tree  can  be  rehung  so  that  the 
root  node  is  at  the  top. 

(6)  The  nodes  of  degree  one  in  a  basis  tree  are  called  pendant  nodes. 

Starting  from  each  pendant  node  and  working  upward  to  the  root  node 
it  is  possible  to  determine  the  shipping  amounts  Similarly, 
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starting  at  the  root  node  and  working  downwards  it  is  possible  to 

detemine  the  optimal  dual  variables  u  and  v  associated  with 

i  J 

nodes  and  C^,  respectively. 

(7)  If  k  is  any  (row  or  column)  node,  not  the  root  node  r,  of  a  basis 
tree,  then  there  is  a  unique  path  in  the  tree  from  k  to  r.  The 
first  encountered  node  on  this  path  is  the  predecessor  p(k)  of  k, 
and  the  nuidber  of  arcs  on  the  path  is  the  distance  d(k)  of  k  from  r. 
Note  that  r  has  no  predecessor  and  we  define  d(r)  =  0. 

(8)  If  Xjj  is  basic  then  (i,j)  is  an  arc  in  the  basis  tree  so  that 

either  (a)  p(Rj)  =  or  (b)  p(C^)  =  case  (a)  we  let 

x(H^)  =  x^^  be  the  shipping  amount  and  in  case  (b)  we  let  x(Cj  = 


(9)  If  h  and  k  are  distinct  nodes  of  the  basis  tree  and  k  is  on 
the  path  between  h  and  the  root  node  r  then  node  h  is  a 
descendant  of  k.  The  basis  tree  is  said  to  be  stored  in  a  recursive 
list  if  it  ban  the  following  property:  if  node  h  is  stored  after 
node  k  on  the  list  and  h  is  not  a  descendant  of  k,  then  all 
descendents  of  k  are  stored  on  the  list  between  k  and  h.  The 
descendents  of  a  node  in  a  recursive  list  can  be  searched  by  calling 
a  recursive  subroutine. 

(10)  The  dual  variables  are  stored  as  node  lists  where  u(R^)  =  u^  and 
v(C^)  =  v^.  (If  k  is  any  node  then  exactly  one  of  u(k)  or  v(k) 
is  defined  depending  on  whether  k  is  a  row  or  column  node. ) 

Figure  6  gives  the  flow  diagram  of  the  primal  transportation  code.  It 
was  implemented  by  using  the  five  node  functions  p(k),  d(k),  x(k),  u(k), 
v(k),  and  by  storing  the  basis  tree  in  a  recursive  list.  The  sequential 
impleanntation  of  the  primal  code  is  discussed  in  more  detail  in  [11],  [14], 
and  [21]. 
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A.  GENERATE  PROBLEM 


Figure  1.  Flow  diagram  for  the  sequential  primal  transportation  algorithm. 


3.  THE  PAHAILBL  PRIMAL  ALGORITHM 

The  efficient  implementation  of  any  algorithm  on  a  parallel  computer 
necessarily  depends  on  the  exact  architecture  of  that  computer  as  well  as  the 
characteristics  of  the  algorithm.  V<e  discuss  the  implementation  of  the  primal 
algorithm  of  the  previous  section  on  a  BBN  Butterfly  Plus  computer.  The 
Butterfly  Plus  is  a  tightly  coupled  heterogeneous,  sheared  memory 
multiprocessor  consisting  of  several  Motorola  68020/68881  processors  each 
having  4  megabytes  of  its  own  local  memory,  and  accessing  the  memories  of  the 
other  processors  through  a  packet  switched  network.  Our  code  was  implemented 
on  a  14  processor  machine.  Discussions  of  the  implementation  on  the  same 
parallel  computer  of  a  shortest  augmenting  path  algorithm  for  assignment  and 
travelling  salesman  problems  can  be  found  in  Miller  and  Pekny  [18]  and  Pekny 
and  Miller  [19] . 

Because  the  Butterfly  computer  has  its  memory  distributed  over  the  14 
processors  it  was  necessary  to  distribute  the  mxn  cost  matrix  C  so  that 
each  processor  stored  several  contiguous  rows  of  C  in  its  local  memory. 
From  this  fact  it  is  obvious  from  Figure  1  that  steps  A  and  C  can  readily  be 
done  in  parallel.  The  time  to  generate  the  protlem  data,  step  A,  was  not 
counted  in  the  computation  time  so  will  not  be  discussed  further.  The  search 
subroutine,  step  C,  was  performed  in  parallel  and  was,  in  fact,  the  only  major 
subroutine  that  we  were  able  to  perform  in  parallel  in  this  version  of  the 
code. 

Subroutine  B,  build  initial  tree,  was  carried  out  sequentially  because 
(a)  it  was  called  only  once  (b)  requires  only  5  to  10  percent  of  the  total 
computation  tine.  Computational  experiments  with  parallel  starting  tree 


bases  will  be  reported  on  elsewhere. 


GENERATE  PROBLEM 


B.  BUILD  INITIAL  TREE 


SEARCH 

Each  of  the  p  processors  searches 
the  next  row  it  owns  for  the  most 
negative  reduced  cost;  if  one  is 
found  it  is  put  in  the  problem 
queue. 


G.  SYNCHRONIZE 

After  finishing  pivoting 
processors  wait  until 
all  others  finish. 


PIVOT 

Each  processor  carries 
out  each  of  the  pivots 
on  its  own  copy  of  the 
basis  tree. 


D.  SYNCHRONIZE 

After  finishing  row 
searches ,  processors 
wait  until  all 
others  finish. 


Is  pivot  queu^ 
list  empty?  > 


Print  optimal 
solution. 


H.  Have  all  rows  in  N 
the  entire  matrix 
been  searched  with¬ 
out  finding  a  pivot ?y 


Figure  2.  Flow  diagram  for  the  parallel  primal  transportation  algorithm. 
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We  now  describe  the  parallel  primal  transportation  algorithm  shown  in 
Figure  3  on  the  assumption  that  p  (>1)  processors  are  available  when  the 
problem  is  being  solved.  Each  processor  stores  the  rows  it  owns  in  a 
wrap-around  list  so  that  when  it  reaches  the  end  of  its  list  it  goes  back  to 
the  beginning.  It  maintains  a  pointer  to  the  current  row  that  is  next  to  be 
searched;  the  pointer  initially  points  to  the  first  row  it  owns. 

After  the  initial  tree  is  built  each  processor  obtains  its  own  copy  of 
the  lists  defining  the  initial  tree.  Then  subroutine  C  is  invoked,  and  each 
processor  searches  its  current  row  for  the  entry  having  the  most  negative 
reduced  cost.  If  such  a  most  negative  reduced  cost  is  found  by  a  processor 
it  writes  the  row  and  column  in  which  it  was  found  and  its  reduced  cost  in  the 
pivot  queue  list;  then  the  processor  updates  its  current  row  pointers  and  goes 
to  synchronization  step  D  where  it  waits  until  all  processors  have  finished 
their  searches.  This  synchronization  step  is  necessary  in  order  that  the 
pivot  queue  list  is  completely  defined  before  pivot  subroutine  F  is  called. 

Control  then  passes  to  step  E  in  which  the  question  is  asked  whether  the 
pivot  queue  list  is  eaipty.  If  the  answer  is  yes  control  passes  to  step  H  in 
which  the  question  is  asked  whether  all  of  the  rows  in  the  entire  matrix  have 
been  searched  without  finding  a  pivot.  If  the  answer  is  yes,  control  is 
passed  to  step  I  and  the  optimal  solution  is  printed.  An  answer  of  no  sends 
the  coaq>uter  back  to  search  step  C. 

In  the  case  that  the  answer  to  the  question  in  E  is  no  then  subroutine  F 
is  called  and  all  p  processors  simultaneously  perform  all  the  pivots  listed 
in  the  pivot  queue  on  their  own  copies  of  the  basis  tree.  The  reason  this  is 
done  is  that  it  is  faster  for  each  processor  to  carry  out  the  calculation 
rather  than  have  just  one  of  them  do  the  calculation  and  communicate  the 
result.  As  discussed  in  the  next  section,  we  also  tried  to  break  the  pivoting 
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process  down  into  snail  tasks  and  have  individual  processors  carry  out  these 
tasks.  However,  the  sizes  of  the  tasks,  sometines  referred  to  as  their 
granularity,  was  found  to  be  too  snail  for  a  parallel  processing  strategy  to 
reduce  the  tine  over  the  previously  reported  strategy. 

After  each  processor  conpletes  its  pivot  task,  it  moves  to 
synchronization  step  G  where  it  waits  until  all  processors  have  finished 
pivoting.  The  reason  for  the  synchronization  step  here  is  that  it  is 
necessary  to  prevent  the  pivot  queue  list  fron  being  altered  before  one  or 
nore  of  the  processors  have  conpleted  their  pivot  tasks. 

Once  all  processors  indicate  in  step  G  that  they  have  finished  pivoting 
control  returns  to  search  step  C,  coapleting  the  main  computational  loop. 

4.  ATTEMPT  TO  PARALLELIZE  THE  PIVOT  STEP 

At  the  end  of  the  search  step  each  processor  has  located  the  most 
negative  element  in  its  current  row,  or  has  detemined  that  the  row  has  no 
negative  elements.  Hence  the  problen  queue  has  up  to  14  possible  pivots  that 
can  be  perfomed.  However  it  is  not  always  possible  to  perfom  two  (or  more) 
pivots  simultaneously  because  the  pivot  operations  nay  require  changes  to  be 
nade  on  the  basis  functions  of  some  of  the  same  nodes. 

To  nake  these  ideas  precise  let  Q  be  the  set  of  potential  pivots,  and 
let  k  be  an  element  of  Q.  Define  T(k)  to  be  the  set  of  basis  tree  nodes 
at  which  one  of  the  five  node  functions  p(k),  d(k),  x(k),  u(k),  v(k)  is 
changed.  IWo  pivots  h  and  k  are  said  to  be  independent  if  T(h)  n  T(k)  =  0, 
and  are  said  to  be  dependent  if  T(h)  n  T(k)  #  0.  Consider  the  graph  P  =  {Q,E} 
whose  nodes  consist  of  the  pivots  k  in  the  pivot  queue,  and  whose  edges 
(h,k)  belong  to  E  if  pivots  h  and  k  are  dependent.  We  would  like  to 
find  a  maximal  independent  subset  S  in  P  which  would  then  consist  of  a  set 
of  pivots  which  can  be  done  in  parallel.  Since  finding  a  naximal  independent 


subset  of  a  set  is  a  well  known  NP  hard  problem,  we  used  the  following 
heuristic  program: 

(1)  Let  S  =  0. 

(2)  Find  a  node  k  of  largest  degree  in  P. 

(3)  ReBK>ve  k  frcn  Q;  put  k  in  S. 

(4)  For  each  edge  (k,h)  in  B,  remove  h  from  Q,  and  remove  (k,h)  from  E. 

(5)  If  Q  9*  0  go  to  (2)  else  go  to  (6). 

(6)  Stop.  Set  S  consists  of  independent  pivots. 

Since  at  least  one  node  is  removed  frcmi  Q  at  each  step  of  the  algorithm,  it 
runs  very  quickly.  The  set  S  will  not  necessarily  be  the  maximal 
independent  subset  of  P,  but  will  usually  be  quite  good. 

We  implesKnted  the  above  procedure  but  found  that  it  was  not 
computationally  as  efficient  as  the  idea  described  in  the  previovis  section. 
The  main  difficulty  is  that  the  computational  effort  of  determining  for  each 
k  in  Q  the  set  T(k)  is  essentially  the  same  as  actually  p>erforaing  the 
pivot  on  k.  To  that  time  the  work  of  determining  the  independent  set  S  and 
actually  carrying  out  the  independent  pivots  in  S  must  be  added.  Overall 
the  computation  time  was  increased  by  the  atteaq>t  to  parallelize  the  pivot 
step,  so  it  was  abandoned. 

5.  COMPUTATIONAL  RESULTS 

In  the  course  of  our  coisputational  experiments  we  solved  more  than  500 
randomly  generated  fully  dense  transportation  problems  on  the  Butterfly 
computer.  The  averaged  data  is  presented  graphically  in  Figures  1-5.  The 
raw  computational  data  appears  in  the  appendix. 

Figure  1  shows  the  performance  of  the  parallel  primal  algorithm  on  square 
transportation  problems  for  sizes  n  =  500, 1000, ..., 3000,  and  for  shipping 
amounts  ranging  from  1  (assignment  problems)  to  1000.  As  noted  in  [22]  for 
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■uch  seller  problea  sizes,  the  execution  times  increase  with  problem  size  and 
with  shipping  amounts.  We  fitted  these  data  points  with  an  exponential 
function  of  the  fora  Kn*  and  found  that  the  exponent  a  ranged  from  a  =  2.0 
for  assignment  problems  to  a  =  2.2  for  transportation  problems  having 
shipping  amounts  in  the  range  [1,1000].  The  correlation  coefficients  for  all 
of  these  were  greater  than  .999,  indicating  an  extremely  good  fit. 

Figure  2  shows  the  effect  of  varying  cost  ranges  and  problem  size  on 
execution  tines  for  assignment  problems.  The  larger  the  cost  range  the  less 
the  dual  degeneracy  the  problem  has.  To  put  it  differently  the  smaller  the 
cost  range  is  the  larger  is  the  nudber  of  alternative  optimal  solutions.  The 
priaal  transportation  algorithm  is  able  to  take  advantage  of  dual  degeneracy, 
and  is  able  to  solve  problems  having  smaller  cost  ranges  much  faster  than 
those  having  larger  cost  ranges.  In  contrast,  some  dual  methods  find  the 
opposite  phenoaena  to  be  the  case,  see  [19]. 

In  order  to  determine  the  speedup  factor  of  the  parallel  primal  algorithm 
we  solved  assignaent  problems  with  n  =  500,  750,  and  1000  on  the  Butterfly 
ctmqniter  with  just  one  processor  (no  larger  problems  could  be  solved  on  a 
single  processor)  and  also  solved  the  same  problems  with  2,  4,  6,  7,  8,  10, 
12,  and  14  processors.  The  averages  of  the  execution  times  of  these  runs  are 
plotted  in  Figure  3.  For  problems  having  the  same  cost  range  the  speedup 
factor  is  the  ratio  of  the  execution  time  on  one  processor  divided  by  the 
smallest  execution  tine  achieved  by  using  any  number  of  processors.  Note  that 
the  speedup  factor  increases  with  problem  size  being  about  2.2  for  n  =  500 
and  increasing  to  about  2.5  for  n  =  1000.  Note  also  that  the  most 
effective  number  of  processors  seems  to  be  in  the  range  6  to  8  for  this  range 
of  problem  sizes.  As  problems  get  larger,  the  primal  algorithm  should  be  able 


to  use  more  processors  effectively  to  improve  the  efficiency  of  the  search 
part  of  the  algorithm,  as  will  become  evident  in  the  next  two  figures. 

Since  we  were  able  to  parallelize  the  search  phase  of  the  algorithm  but 
not  the  pivot  phase  it  is  important  to  find  out  whether  the  percentage  of  time 
spent  increased  with  problem  size.  We  made  runs  on  the  sequential  version  of 
the  algorithm  using  a  SUN  work  station.  The  average  percentage  of  search  time 
for  three  assignment  problems  with  sizes  ranging  from  n  =  100  to  n  =  2500  and 
for  two  cost  ranges  [0,1000]  and  [0,10000],  are  shown  in  the  bar  graph  of 
Figure  4.  The  search  percentage  ranged  from  about  55X  for  n  =  100  to  about 
95it  for  n  =  2500.  The  fact  that  we  parallelized  the  part  of  the  sequential 
algorithm  that  increases  with  n  was  the  main  reason  for  the  good  performance 
of  the  parallel  primal  algorithm. 

In  order  to  investigate  further  the  speedup  of  the  parallel  versus  the 
sequential  code  we  made  use  of  a  speedup  formula  due  to  Rettberg  and  Thomas 
[20] .  Let  T  be  the  total  execution  time,  let  f  the  fraction  of  time  spent 
in  the  search  phase  of  the  algorithm,  and  let  S  be  the  speedup  factor.  Then 
speedup,  S,  is 

S  -  T  -  P 

(l-f)T  +  fT/p  ■  (l-f)p  +  f 

In  Figure  5  we  have  plotted  this  curve  for  problems  having  costs  in  the  range 
[0,1000]  with  f  =  .83  as  can  be  observed  in  Figure  4.  We  also  computed  the 
observed  speedup  values  by  dividing  the  execution  time  in  Figure  3  for  each 
value  of  p  into  its  value  for  p  =  1.  These  points  are  also  plotted  in  Figure 
5.  The  maximum  speedup  observed  when  p  =  8  was  70  percent  of  its 


vfl 


theoretical  maximum  value.  We  were  undsle  to  continue  this  study  for  larger 


values  of  p  because  of  the  memory  limitation  of  a  single  processor  of  the 
Butterfly  ccsnputer. 

6.  CONCLUSIONS 

This  paper  is  a  continuation  of  the  experimental  investigations  begun  in 
the  1970 ’s  on  primal  simplex  transportation  codes.  It  presents  evidence  that 
the  primal  code  is  well  suited  for  parallel  computation  at  least  with  the 
c<»puter  architecture  of  the  Butterfly  computer.  Significant  speedup  factors 
for  parallel  over  sequential  machines  were  achieved  which  enabled  the  parallel 
coi^niter  to  solve  fully  dense  transportation  and  assignment  problems  which  are 
at  least  one  order  of  magnitude  larger  than  those  previously  reported. 
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Figure  1.  The  effect  on  execution  time  of  varying 
shipping  arounts  for  square  dense 
transportation  problems.  The  cost  range  was 
[0,1000].  The  lowest  curve  gives  solution 
times  for  assignment  problems.  Foiirteen 
parallel  processors  were  used. 
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Figure  2.  The  effect  on  execution  tiaee  of  varying 

cost  ranges  for  square  asslgnaent  problems. 
Fourteen  parallel  processors  were  used. 
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Figure  3.  The  effect  on  execution  tiaes  of  changing 

the  xniaber  of  parallel  processors  for  three 
different  assignoent  problea  sizes.  For 
each  curve  the  aaxiaua  speedup  factor  is 
the  ratio  of  the  tiae  for  one  processor 
divided  by  the  ssallest  time  for  any 
nuaber  of  processors. 
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Figure  5.  Theoretical  onxioua  speedup  and  actual 
speedup  as  a  fiinction  of  the  nuober  of 
processors  in  solving  assignment  problems 
with  cost  ranges  [0,1000].  The  maximum 
speedup,  achieved  with  8  processors,  was 
approximately  70  percent  of  the  theoretical 
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Appendix 

The  following  four  tables  contain^  the  run  time  information  presented,  in 
Figures  1-5,  and  alsa>,jadditional  information  concerning  the  dispersion  of  run 
times,  build  initial  tree  time,  search  and  pivot  time,  number  of  rows 
searched,  nui^er  of  pivots,  and  number  of  pivots  per  search.  This  information 
may  be  useful  to  others  who  are  implementing  the  primal  or  other 
transportation  algorithms  on  parallel  computers.  ,  ^ 
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Ship  [1] 


runs 

size 

run  time 

run  time 

build  tree 

search  and 

rows 

number 

pivots 

avg/std 

min/max 

time 

pivot  time 

searched 

pivots 

/search 

10 

500 

22.3/1.3 

20.5/24.7 

5.3 

16.3 

19961 

9601 

6.74 

10 

1000 

86.3/5.3 

78.0/93.5 

20.9 

64.2 

59424 

30536 

7.21 

10 

1500 

203.4/12.1 

188.5/229.1 

47.8 

153.6 

118073 

60914 

7.23 

10 

2000 

370.4/16.7 

346.8/402.4 

82.0 

285.8 

192375 

103853 

7.56 

10 

2500 

578.1/19.1 

556.0/610.0 

130.2 

444.8 

254412 

145625 

8.02 

10 

3000 

801.4/32.5 

753.1/854.3 

181.8 

615.6 

309771 

187116 

8.46 

Ship  [1,10] 


runs 

size 

run  time 

run  time 

build  tree 

search  and 

avg/std 

min/max 

time 

pivot  time 

10 

500 

25.8/2.1 

23.5/29.2 

4.6 

20.6 

10 

1000 

99.9/13.6 

87.9/134.4 

17.8 

80.9 

10 

1500 

223.6/20.6 

196.6/270.8 

39.2 

182.4 

10 

2000 

405.4/35.4 

375.0/485.0 

71.4 

331.3 

10 

2500 

653.4/34.5 

609.3/707.0 

111.8 

538.5 

9 

3000 

931.2/85.9 

849.2/1143.5 

158.5 

768.7 

rows 

number 

pivots 

searched 

pivots 

/search 

15187 

5954 

5.47 

40616 

16829 

5.75 

71565 

30166 

5.89 

109752 

46122 

5.87 

156575 

66126 

5.92 

194868 

85376 

6.15 

Ship  [1,100] 


1  runs 

size 

run  time 

run  time 

build  tree 

search  and 

rows 

number 

pivots 

avg/std 

min/max 

time 

pivot  time 

searched 

pivots 

/search 

1 

500 

56.3/66.1 

32.0/244.4 

4.5 

51.2 

14978 

5148 

4.80 

!  10 

1000 

142.9/27.0 

125.8/218.8 

17.5 

124.2 

37595 

13044 

4.82 

1 

1500 

337.9/43.2 

308.3/447.2 

40.0 

296.1 

64167 

22166 

4.84 

10 

2000 

635.5/84.5 

577.8/843.6 

69.0 

564.0 

98075 

33001 

4.68 

10 

2500 

1025.8/94.2 

947.7/ 1250.0 

109.0 

913.7 

124921 

44339 

4.96 

10 

3000 

1522.7/179.5 

1404.3/1948.3 

155.8 

1363.2 

161806 

56584 

4.90 

Ship  [1,1000] 

runs 

size 

run  time 

run  time 

build  uee 

search  and 

rows 

number 

pivots 

I 

avg'std 

min/max 

time 

pivot  time 

searched 

pivots 

/search 

i  10 

500 

38.5/3.8 

33.8/45.2 

4.5 

33.4 

14683 

5154 

4.90 

10 

1000 

168.4/33.9 

146.3/261.3 

17.5 

149.6 

39022 

12891 

4.60 

10 

1500 

399.5/48.3 

351.5/524.6 

38.8 

358.8 

64849 

21729 

4.69 

10 

2000 

784.2/97.0 

696.0/ 1027.6 

70.1 

711.3 

101129 

32223 

4.45 

10 

2500 

1279.5/110.8 

1185.2/1525.9 

110.2 

1166.1 

128667 

42681 

4.66 

>  8 

3000 

1937.1/262.6 

1795.2/2578.5 

156.7 

1776.3 

163039 

54173 

4.67 

s 

Table  1.  Raw  data  for  the  graphs  in  Figure  1.  All  problems  | 

were  randomly  generated  with  costs  in  the  range 
[0,1000].  All  problems  were  solved  using  a  14 
processor  butterfly  computer. 
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runs 

size 

run  time 
avg/std 

run  time 
min/max 

build  tree 
time 

search  and 
pivot  time 

rows 

searched 

number 

pivots 

pivots 

/search 

10 

500 

13.9/0.8 

12.5/15.5 

7.2 

5.9 

7996 

3880 

6.81 

10 

1000 

41.1/1.3 

38.9/42.9 

28.0 

11.7 

9805 

5124 

7.33 

10 

1500 

79.3/3.1 

74.1/83.2 

60.7 

16.4 

11725 

5871 

7.02 

10 

2000 

135.4/5.9 

125.6/142.3 

107.0 

25.4 

13386 

6470 

6.78 

10 

2500 

200.3/4.0 

195.5/207.5 

168.5 

28.0 

15190 

7222 

6.66 

10 

3000 

281.9/10.6 

264.9/301.9 

243.7 

33.7 

17133 

7844 

6.42 

Cost  [0,1000] 


runs 

size 

run  time 

run  time 

build  tree 

search  and 

rows 

number 

pivots 

avg/std 

min/max 

time 

pivot  time 

searched 

pivots 

/search 

10 

500 

22.3/1.3 

20.5/24.7 

5.3 

16.3 

19961 

9601 

6.74 

10 

1000 

86.3/5.3 

78.0/93.5 

20.9 

64.2 

59424 

30536 

7.21 

10 

1500 

203.4/12.1 

188.5/229.1 

47.8 

153.6 

118073 

60914 

7.23 

10 

2000 

370.4/16.7 

346.8/402.4 

82.0 

285.8 

192375 

103853 

7.56 

10 

2500 

578.1/19.1 

556.0/610.0 

130.2 

444.8 

254412 

145625 

8.02 

10 

3000 

801.4/32.5 

753.1/854.3 

181.8 

615.6 

309771 

187116 

8.46 

Cost  [0,10000] 


runs 

size 

run  time 
avg/std 

run  time 
min/max 

build  tree 
time 

search  and 
pivot  time 

rows 

searched 

number 

pivots 

pivots 

/search 

10 

500 

23.6/1.6 

19.6/25.2 

5.3 

17.7 

22605 

10718 

6.64 

10 

1000 

98.4/6.7 

89.3/111.4 

21.4 

75.7 

73509 

38064 

7.26 

10 

1500 

237.9/10.8 

219.6/257.5 

46.6 

189.3 

157858 

84369 

7.50 

10 

2000 

449.8/29.3 

422.3/504.4 

82.7 

364.4 

265005 

146606 

7.75 

10 

2500 

748.7/41.0 

668.8/819.5 

135.7 

609.8 

402653 

221668 

7.71 

10 

3000 

1136.4/60.0 

1001.1/1220.7 

185.2 

947.2 

567511 

324035 

8.00 

Table  2.  Saw  data  for  Figure  2.  All  the  problems  were 
randomly  generated  assignment  problems  of 
stated  sizes  and  costs  chosen  in  the  given 
ranges.  All  were  solved  using  a  14  processor 
Butterfly  computer. 
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Execution  Time  vs.  Processors 


cessors 

n  =  500 

n  =  750 

b  =  1000 

1 

45.7 

104.3 

199.3 

2 

32.0 

73.7 

138.3 

4 

23.5 

52.3 

102.2 

6 

22.23 

49.2 

86.5 

7 

22.3 

— 

86.3 

8 

21.6 

47.7 

77.6 

10 

21.0 

48.3 

81.8 

12 

21.0 

48.7 

86.8 

14 

21.0 

49.3 

83.9 

Table  3.  Data  used  for  Fibres  3  and  5.  All 
problems  were  randomly  generated 
assignment  problems  with  costs  in 
the  range  [0,1000] 


Search  Time  as  a  Percentage 
of  Execution  Time 


n 

Cost 

[0,1000] 

Cost 

[0,10000] 

100 

56.4 

54.1 

250 

61.7 

63.9 

500 

78.1 

78.6 

1000 

82.8 

94.2 

1500 

94.2 

95.5 

2000 

95.0 

96.0 

2500 

94.8 

96.4 

Table  4.  Data  for  the  graph  in  Figure  4.  The 

nuadiers  are  average  times  for  the  solution 
of  three  randomly  generated  assignment 
problems  for  various  problem  sizes  and 
two  different  cost  ranges.  All  problems 
were  solved  by  a  sequential  version  of 
the  primal  code  on  a  SUN  work  station. 
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